Code
library(tidyverse)
library(tidymodels)
library(skimr)
library(plotly)
library(knitr)This dataset allows us to delve into feature engineering procedures, such as subsampling to address class imbalance (given the significantly higher number of survivors compared to fatalities) and imputing missing data, particularly for expedition members with incomplete information, such as age.
library(tidyverse)
library(tidymodels)
library(skimr)
library(plotly)
library(knitr)Loading and exploring the dataset
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
skim(members)| Name | members |
| Number of rows | 76519 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| logical | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| expedition_id | 0 | 1.00 | 9 | 9 | 0 | 10350 | 0 |
| member_id | 0 | 1.00 | 12 | 12 | 0 | 76518 | 0 |
| peak_id | 0 | 1.00 | 4 | 4 | 0 | 391 | 0 |
| peak_name | 15 | 1.00 | 4 | 25 | 0 | 390 | 0 |
| season | 0 | 1.00 | 6 | 7 | 0 | 5 | 0 |
| sex | 2 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| citizenship | 10 | 1.00 | 2 | 23 | 0 | 212 | 0 |
| expedition_role | 21 | 1.00 | 4 | 25 | 0 | 524 | 0 |
| death_cause | 75413 | 0.01 | 3 | 27 | 0 | 12 | 0 |
| injury_type | 74807 | 0.02 | 3 | 27 | 0 | 11 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| hired | 0 | 1 | 0.21 | FAL: 60788, TRU: 15731 |
| success | 0 | 1 | 0.38 | FAL: 47320, TRU: 29199 |
| solo | 0 | 1 | 0.00 | FAL: 76398, TRU: 121 |
| oxygen_used | 0 | 1 | 0.24 | FAL: 58286, TRU: 18233 |
| died | 0 | 1 | 0.01 | FAL: 75413, TRU: 1106 |
| injured | 0 | 1 | 0.02 | FAL: 74806, TRU: 1713 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2000.36 | 14.78 | 1905 | 1991 | 2004 | 2012 | 2019 | ▁▁▁▃▇ |
| age | 3497 | 0.95 | 37.33 | 10.40 | 7 | 29 | 36 | 44 | 85 | ▁▇▅▁▁ |
| highpoint_metres | 21833 | 0.71 | 7470.68 | 1040.06 | 3800 | 6700 | 7400 | 8400 | 8850 | ▁▁▆▃▇ |
| death_height_metres | 75451 | 0.01 | 6592.85 | 1308.19 | 400 | 5800 | 6600 | 7550 | 8830 | ▁▁▂▇▆ |
| injury_height_metres | 75510 | 0.01 | 7049.91 | 1214.24 | 400 | 6200 | 7100 | 8000 | 8880 | ▁▁▂▇▇ |
How has the rate of expedition success and member death changed over time?
plot1 <- members %>%
group_by(year = 10 * (year %/% 10)) %>%
summarise(
died = mean(died),
success = mean(success)
) %>%
pivot_longer(died:success, names_to = "outcome", values_to = "percent") %>%
ggplot(aes(year, percent, color = outcome)) +
geom_line(alpha = 0.7, size = 1.5) +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = "year", y = "% of expedition members", color = NULL)
plotly::ggplotly(plot1) Is there a relationship between the expedition member’s age and success of the expedition or death? We can use the same code but just switch out year for age.
plot2 <- members %>%
group_by(age = 10 * (age %/% 10)) %>%
summarise(
died = mean(died),
success = mean(success)
) %>%
pivot_longer(died:success, names_to = "outcome", values_to = "percent") %>%
ggplot(aes(age, percent, color = outcome)) +
geom_line(alpha = 0.7, size = 1.5) +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = "age", y = "% of expedition members", color = NULL)
plotly::ggplotly(plot2) Are people more likely to die on unsuccessful expeditions?
members %>%
count(success, died) %>%
group_by(success) %>%
mutate(percent = scales::percent(n / sum(n))) %>%
kable(
col.names = c("Expedition success", "Died", "Number of people", "% of people"),
align = "llrr"
)| Expedition success | Died | Number of people | % of people |
|---|---|---|---|
| FALSE | FALSE | 46452 | 98% |
| FALSE | TRUE | 868 | 2% |
| TRUE | FALSE | 28961 | 99% |
| TRUE | TRUE | 238 | 1% |
We can use a similar approach to see how different the rates of death are on different peaks in the Himalayas.
members %>%
filter(!is.na(peak_name)) %>%
mutate(peak_name = fct_lump(peak_name, prop = 0.05)) %>%
count(peak_name, died) %>%
group_by(peak_name) %>%
mutate(percent = scales::percent(n / sum(n))) %>%
kable(
col.names = c("Peak", "Died", "Number of people", "% of people"),
align = "llrr"
)| Peak | Died | Number of people | % of people |
|---|---|---|---|
| Ama Dablam | FALSE | 8374 | 100% |
| Ama Dablam | TRUE | 32 | 0% |
| Cho Oyu | FALSE | 8838 | 99% |
| Cho Oyu | TRUE | 52 | 1% |
| Everest | FALSE | 21507 | 99% |
| Everest | TRUE | 306 | 1% |
| Manaslu | FALSE | 4508 | 98% |
| Manaslu | TRUE | 85 | 2% |
| Other | FALSE | 32171 | 98% |
| Other | TRUE | 631 | 2% |
Let’s make one last exploratory plot and look at seasons. How much difference is there in survival across the four seasons?
plot3 <- members %>%
filter(season != "Unknown") %>%
count(season, died) %>%
group_by(season) %>%
mutate(
percent = n / sum(n),
died = case_when(
died ~ "Died",
TRUE ~ "Did not die"
)
) %>%
ggplot(aes(season, percent, fill = season)) +
geom_col(alpha = 0.8, position = "dodge", show.legend = FALSE) +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~died, scales = "free") +
labs(x = NULL, y = "% of expedition members")
plotly::ggplotly(plot3)Let’s now create the dataset that we’ll use for modeling by filtering on some of the variables and transforming some variables to a be factors. There are still lots of NA values for age but we are going to impute those.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(
died ~ "died",
TRUE ~ "survived"
)) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
members_df# A tibble: 76,507 × 9
peak_id year season sex age citizenship hired success died
<fct> <dbl> <fct> <fct> <dbl> <fct> <int> <int> <fct>
1 AMAD 1978 Autumn M 40 France 0 0 survived
2 AMAD 1978 Autumn M 41 France 0 0 survived
3 AMAD 1978 Autumn M 27 France 0 0 survived
4 AMAD 1978 Autumn M 40 France 0 0 survived
5 AMAD 1978 Autumn M 34 France 0 0 survived
6 AMAD 1978 Autumn M 25 France 0 0 survived
7 AMAD 1978 Autumn M 41 France 0 0 survived
8 AMAD 1978 Autumn M 29 France 0 0 survived
9 AMAD 1979 Spring M 35 USA 0 0 survived
10 AMAD 1979 Spring M 37 W Germany 0 1 survived
# ℹ 76,497 more rows
We’ll split or spend our data to generate training and testing sets.
set.seed(111)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)We use resampling to evaluate model performance. Getting those resampled sets ready.
set.seed(111)
members_folds <- vfold_cv(members_train, strata = died)
members_folds# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [51642/5738]> Fold01
2 <split [51642/5738]> Fold02
3 <split [51642/5738]> Fold03
4 <split [51642/5738]> Fold04
5 <split [51642/5738]> Fold05
6 <split [51642/5738]> Fold06
7 <split [51642/5738]> Fold07
8 <split [51642/5738]> Fold08
9 <split [51642/5738]> Fold09
10 <split [51642/5738]> Fold10
Next we build a recipe for data preprocessing.
First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
Next, we impute the missing values for age using the median age in the training data set. There are more complex steps available for imputation, but we’ll stick with a straightforward option here.
Next, we use step_other() to collapse categorical levels for peak and citizenship. Before this step, there were hundreds of values in each variable.
After this, we can create indicator variables for the non-numeric, categorical values, except for the outcome died which we need to keep as a factor.
Finally, there are many more people who survived their expedition than who died (thankfully) so we will use step_smote() to balance the classes.
The object members_rec is a recipe that has not been trained on data yet (for example, which categorical levels should be collapsed has not been calculated).
library(themis)
members_rec <- recipe(died ~ ., data = members_train) %>%
step_impute_median(age) %>%
step_other(peak_id, citizenship) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died)
members_recComparing two different models: a logistic regression model and a random forest model.
glm_spec <- logistic_reg() %>%
set_engine("glm")
glm_specLogistic Regression Model Specification (classification)
Computational engine: glm
rf_spec <- rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_specRandom Forest Model Specification (classification)
Main Arguments:
trees = 1000
Computational engine: ranger
Putting together tidymodels workflow()
members_wf <- workflow() %>%
add_recipe(members_rec)
members_wf══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_other()
• step_dummy()
• step_smote()
Now we can add a model, and the fit to each of the resamples. First, we can fit the logistic regression model. Setting a non-default metric set so we can add sensitivity and specificity.
members_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)
glm_rs <- members_wf %>%
add_model(glm_spec) %>%
fit_resamples(
resamples = members_folds,
metrics = members_metrics,
control = control_resamples(save_pred = TRUE)
)
glm_rs# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [51642/5738]> Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
2 <split [51642/5738]> Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
3 <split [51642/5738]> Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
4 <split [51642/5738]> Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
5 <split [51642/5738]> Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
6 <split [51642/5738]> Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
7 <split [51642/5738]> Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
8 <split [51642/5738]> Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
9 <split [51642/5738]> Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
10 <split [51642/5738]> Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
Now fitting the random forest model.
rf_rs <- members_wf %>%
add_model(rf_spec) %>%
fit_resamples(
resamples = members_folds,
metrics = members_metrics,
control = control_resamples(save_pred = TRUE)
) rf_rs# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [51642/5738]> Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
2 <split [51642/5738]> Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
3 <split [51642/5738]> Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
4 <split [51642/5738]> Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
5 <split [51642/5738]> Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
6 <split [51642/5738]> Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
7 <split [51642/5738]> Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
8 <split [51642/5738]> Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
9 <split [51642/5738]> Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
10 <split [51642/5738]> Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
We have fit each of our candidate models to our resampled training set!
Let’s evaluate our models.
collect_metrics(glm_rs)# A tibble: 4 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.615 10 0.00264 Preprocessor1_Model1
2 roc_auc binary 0.703 10 0.0117 Preprocessor1_Model1
3 sensitivity binary 0.684 10 0.0158 Preprocessor1_Model1
4 specificity binary 0.614 10 0.00267 Preprocessor1_Model1
Well, this is middling but at least mostly consistent for the positive and negative classes. The function collect_metrics() extracts and formats the .metrics column from resampling results like the ones we have here.
collect_metrics(rf_rs)# A tibble: 4 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.974 10 0.000757 Preprocessor1_Model1
2 roc_auc binary 0.756 10 0.0131 Preprocessor1_Model1
3 sensitivity binary 0.152 10 0.00992 Preprocessor1_Model1
4 specificity binary 0.986 10 0.000690 Preprocessor1_Model1
The accuracy is great but that sensitivity is really bad with respect to logistic regression. The random forest model has not learnt recognition of both classes properly, even with the oversampling strategy. Let’s dig deeper into how these models are doing to see this more. For example, how are they predicting the two classes?
glm_rs %>%
conf_mat_resampled()# A tibble: 4 × 3
Prediction Truth Freq
<fct> <fct> <dbl>
1 died died 56
2 died survived 2182
3 survived died 25.9
4 survived survived 3474.
rf_rs %>%
conf_mat_resampled()# A tibble: 4 × 3
Prediction Truth Freq
<fct> <fct> <dbl>
1 died died 12.5
2 died survived 80.3
3 survived died 69.4
4 survived survived 5576.
The random forest model is quite bad at identifying which expedition members died, while the logistic regression model does about the same for both classes.
plot5 <- glm_rs %>%
collect_predictions() %>%
group_by(id) %>%
roc_curve(died, .pred_died) %>%
ggplot(aes(1 - specificity, sensitivity, color = id)) +
geom_abline(lty = 2, color = "gray9 ", size = 1.5) +
geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
coord_equal()
plotly::ggplotly(plot5)It is finally time for us to return to the testing set. Notice that we have not used the testing set yet during this whole analysis; to compare and assess models we used resamples of the training set. Let’s fit one more time to the training data and evaluate on the testing data using the function last_fit().
members_final <- members_wf %>%
add_model(glm_spec) %>%
last_fit(members_split)
members_final# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [57380/19127]> train/test sp… <tibble> <tibble> <tibble> <workflow>
The metrics and predictions here are on the testing data.
collect_metrics(members_final)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.615 Preprocessor1_Model1
2 roc_auc binary 0.689 Preprocessor1_Model1
collect_predictions(members_final) %>%
conf_mat(died, .pred_class) Truth
Prediction died survived
died 191 7273
survived 96 11567
The coefficients (which we can get out using tidy()) have been estimated using the training data. If we use exponentiate = TRUE, we have odds ratios.
members_final %>%
pull(.workflow) %>%
pluck(1) %>%
tidy(exponentiate = TRUE) %>%
arrange(estimate) %>%
kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.000 | 0.938 | -53.347 | 0.000 |
| peak_id_MANA | 0.263 | 0.041 | -32.916 | 0.000 |
| peak_id_other | 0.266 | 0.033 | -40.502 | 0.000 |
| peak_id_EVER | 0.362 | 0.035 | -29.227 | 0.000 |
| sex_M | 0.460 | 0.030 | -25.612 | 0.000 |
| hired | 0.488 | 0.056 | -12.923 | 0.000 |
| citizenship_other | 0.587 | 0.032 | -16.847 | 0.000 |
| citizenship_Japan | 0.706 | 0.037 | -9.309 | 0.000 |
| citizenship_Nepal | 0.875 | 0.063 | -2.133 | 0.033 |
| season_Winter | 0.884 | 0.042 | -2.940 | 0.003 |
| season_Spring | 0.951 | 0.016 | -3.231 | 0.001 |
| age | 0.991 | 0.001 | -12.958 | 0.000 |
| year | 1.027 | 0.000 | 55.770 | 0.000 |
| peak_id_CHOY | 1.101 | 0.041 | 2.339 | 0.019 |
| citizenship_UK | 1.773 | 0.045 | 12.780 | 0.000 |
| citizenship_USA | 1.815 | 0.044 | 13.604 | 0.000 |
| success | 2.184 | 0.016 | 48.953 | 0.000 |
| season_Summer | 3.689 | 0.080 | 16.289 | 0.000 |
plot4 <- members_final %>%
pull(.workflow) %>%
pluck(1) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
ggplot(aes(estimate, fct_reorder(term, estimate))) +
geom_vline(xintercept = 0, color = "gray7", lty = 2, size = 1.2) +
geom_errorbar(aes(
xmin = estimate - std.error,
xmax = estimate + std.error
),
width = .2, color = "gray7", alpha = 0.7
) +
geom_point(size = 2, color = "darkcyan") +
labs(y = NULL, x = "Coefficent from logistic regression")
plotly::ggplotly(plot4)The features with coefficients on the positive side (like climbing in summer, being on a successful expedition, or being from the UK or US) are associated with surviving.
The features with coefficients on the negative side (like climbing specific peaks including Everest, being one of the hired members of a expedition, or being a man) are associated with dying.
We need to consider the interpretation of model coefficients in the context of our model’s moderate predictive accuracy. The model may not capture all the factors influencing expedition survival. Additionally, the results indicate a heightened risk for native Sherpa climbers hired as expedition members in Nepal, underscoring the dangers associated with this particular demographic group during mountain expeditions.
Further readings:
1) Himalayan Climbing Expeditions modelling.
2) Online platform to learn tidymodels.
3) Tidymodels in R book.